3D Object Representations¶

In the realm of 3D computer vision, solid objects are typically represented using one of two main categories:

  • Boundary Representations: This category encompasses methods that define the limits or edges of objects. It includes:

    • Point Clouds: A collection of points in a 3D space. Each point represents a vertex of the object's surface.
    • Meshes: Composed of vertices, edges, and faces that define the shape of a 3D object. Meshes provide a more detailed representation than point clouds, as they include information on how these points are connected.
    • Parametric Surfaces: Surfaces defined by parametric equations. They offer a smooth representation and can be used to model complex shapes with fewer data points than meshes.
  • Space-partitioning Representations: These methods divide the 3D space into discrete segments to represent the volume of objects. They include:

    • Voxels: The 3D equivalent of pixels. A voxel represents a value on a regular grid in three-dimensional space. Voxels are particularly useful for representing volumetric data.
    • Implicit Surfaces: Defined by a function that specifies whether a point in space is inside or outside the object. These surfaces are powerful for modeling smooth, continuous surfaces and complex topologies.

This tutorial aims to explore the rendering of voxels, meshes, and point clouds using Python packages, with a primary focus on PyTorch3D. PyTorch3D is a versatile library designed to support 3D data processing tasks, offering efficient implementations for various 3D representations.

The aim of this notebook is to use Python packages, primarily PyTorch3D, for rendering voxels, meshes, and point clouds.

TODO: Add Parametric Surfaces and Implicit Surfaces into this tutorial.

In [ ]:
from pathlib import Path
import functools
from pprint import pprint
import gc

import imageio
import laspy
import matplotlib.pyplot as plt
import numpy as np
import torch

from pytorch3d.io import load_obj, load_objs_as_meshes
from pytorch3d.renderer.blending import BlendParams
from pytorch3d.renderer.lighting import PointLights
from pytorch3d.renderer.materials import Materials
from pytorch3d.renderer.mesh.renderer import MeshRenderer
from pytorch3d.renderer.mesh.rasterizer import MeshRasterizer, RasterizationSettings
from pytorch3d.renderer.mesh.textures import TexturesVertex, Textures
from pytorch3d.renderer.mesh.shader import SoftPhongShader
from pytorch3d.renderer.points.renderer import PointsRenderer
from pytorch3d.renderer.points.rasterizer import (
    PointsRasterizer,
    PointsRasterizationSettings,
)
from pytorch3d.renderer.points.compositor import AlphaCompositor
from pytorch3d.renderer.implicit.renderer import VolumeRenderer
from pytorch3d.renderer.implicit.raysampling import NDCMultinomialRaysampler
from pytorch3d.renderer.implicit.raymarching import EmissionAbsorptionRaymarcher
from pytorch3d.renderer.cameras import (
    look_at_view_transform,
    look_at_rotation,
    FoVOrthographicCameras,
    FoVPerspectiveCameras,
)
from pytorch3d.structures import Meshes, Pointclouds, Volumes

Device Selection: It's preferable to use a GPU for running this notebook. With GPU acceleration, each block can complete in tens of seconds or around one to two minutes at most. However, if a CPU is used, it may take three to five times longer to run.

In [ ]:
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    torch.cuda.device(device)
else:
    device = torch.device("cpu")

print("Using device: ", device)

DATA_DIR = Path("./data")
OUTPUT_DIR = Path("./output")
Using device:  cuda:0

Coordinate Systems in PyTorch3D¶

PyTorch3D works with four primary coordinate systems to manage 3D objects and camera perspectives. These coordinate systems are:

  1. World Coordinates: This system defines the position and orientation of objects in a 3D scene globally. It is the universal frame of reference, where each object's location is specified relative to a common origin point.

  2. Camera Coordinates (View Coordinates): Camera coordinates position objects relative to the camera's viewpoint. In this system, objects are transformed from the world coordinates to a viewpoint where the camera is at the origin, aiming along a specific axis (commonly the negative z-axis). This transformation facilitates the rendering process by aligning objects according to the camera's perspective.

  3. NDC (Normalized Device Coordinates): After defining objects in the camera's perspective, the coordinates are normalized to a unit cube before rasterization. The NDC system ensures that coordinates are within a standard range (typically from -1 to 1 in all three axes), making it easier to map them to the 2D screen space.

  4. Screen Coordinates (Image Coordinates): The final transformation maps NDC to the actual pixel coordinates on the screen or viewport. This step adjusts for the aspect ratio and dimensions of the output medium, translating and scaling the normalized coordinates to fit the display window or image.

Camera Models in PyTorch3D¶

PyTorch3D offers four camera models to simulate various photographic and viewing effects. The available camera models are:

  1. FoVPerspectiveCameras: This model simulates a field of view (FoV) perspective camera, which mimics the way human eyes and traditional cameras perceive depth. Objects closer to the camera appear larger than those further away, creating a realistic sense of depth. The field of view can be adjusted to widen or narrow the camera's visible scene.

  2. FoVOrthographicCameras: Combining field of view settings with an orthographic projection, this camera model removes perspective distortion. It is useful for cases where maintaining the true dimensions of objects regardless of depth is important, such as in architectural visualization.

  3. PerspectiveCameras: A standard perspective camera model without explicit field of view parameters. Instead, it uses focal length and principal point parameters to define the camera's perspective projection. This model is versatile for general 3D rendering tasks where realistic depth perception is desired.

  4. OrthographicCameras: This model projects 3D objects onto the 2D screen without perspective distortion, meaning objects retain their size regardless of their distance from the camera. It is ideal for technical and engineering applications where accurate dimensions are critical.

TODO: Read this and this to understand more about fish-eye model and implement fish-eye intrinsic parameters in the future.

Meshes Representation¶

Meshes are 2D surfaces that collectively form the outer shape of a 3D object. These meshes can either be composed of triangles or quadrilaterals. However, due to their simplicity and versatility, triangular meshes are most commonly used in 3D computer vision and graphics. A mesh is primarily composed of two components: the coordinates of its vertices and the indices of these vertices that form a triangle.

Each triangle in a mesh is defined by three vertices, and these vertices are identified by their positions in 3D space, typically represented as (x, y, z) coordinates. The indices of these vertices are used to uniquely identify each triangle, facilitating efficient storage and rendering by graphics hardware.

Textures:

Textures can significantly enhance the visual appearance of meshes. By mapping a 2D image onto the surface of a 3D model, textures add details such as colors, patterns, and real-world appearance to the object without increasing the complexity of the mesh. This process involves defining UV coordinates for each vertex, which correspond to specific points on the 2D texture image. The rendering system then interpolates these coordinates across the surface of the triangles to apply the texture.

Additional Features:

Beyond the basic structure and texturing, meshes can include several additional features that enhance their realism and utility:

  • Normal Vectors: Each vertex or face in a mesh can have a normal vector, which is a unit vector pointing perpendicular to the surface. Normal vectors are crucial for lighting calculations, as they determine how light reflects off surfaces, affecting the appearance of the mesh under different lighting conditions.

  • Vertex Colors: While textures provide detailed surface appearance, vertex colors allow for the coloring of individual vertices. The colors of the vertices are then interpolated across the faces of the mesh, enabling a gradient effect or simple color variations without the need for a texture.

In this tutorial, the ./data/IronMan folder contains the following files:

none
IronMan.mtl
IronMan.obj
v1_0_IronManRigged.max

Blender can import the IronMan.obj file and render the Iron Man model pretty well as shown in the image below. However, I haven't figured out how to integrate the colors using PyTorch3D.

No description has been provided for this image

The Iron Man

Upon importing `IronMan.obj` into Blender, the software correctly uses the textures from the associated files, showing a detailed 3D representation of Iron Man. This mesh is adapted from "https://free3d.com/".

TODO: Incorporate the native textures of the "IronMan.obj" into rendering with PyTorch3D.

In [ ]:
mesh_filename = DATA_DIR / "IronMan/IronMan.obj"

vertices, faces, aux = load_obj(
    f=mesh_filename,
    load_textures=True,
    create_texture_atlas=True,
    texture_atlas_size=4,
    texture_wrap="repeat",
    device=device,
)
print("vertices: ", type(vertices), vertices.shape)
print("faces: ", type(faces))
pprint(dir(faces))

# TODO: Figure out how to use the following textures to render prettier Iron Man.
print("aux.normals: ", aux.normals.shape)
print("aux.verts_uvs: ", aux.verts_uvs.shape)
print("aux.material_colors: ")
pprint(aux.material_colors)
vertices:  <class 'torch.Tensor'> torch.Size([129759, 3])
faces:  <class 'pytorch3d.io.obj_io.Faces'>
['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__match_args__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '_asdict',
 '_field_defaults',
 '_fields',
 '_make',
 '_replace',
 'count',
 'index',
 'materials_idx',
 'normals_idx',
 'textures_idx',
 'verts_idx']
aux.normals:  torch.Size([129759, 3])
aux.verts_uvs:  torch.Size([10865, 2])
aux.material_colors: 
{'14_-_Default': {'ambient_color': tensor([0.0000, 0.2000, 0.9000], device='cuda:0'),
                  'diffuse_color': tensor([0.0000, 0.2000, 0.9000], device='cuda:0'),
                  'shininess': tensor([0.], device='cuda:0'),
                  'specular_color': tensor([0.9000, 0.9000, 0.9000], device='cuda:0')},
 'black': {'ambient_color': tensor([0., 0., 0.], device='cuda:0'),
           'diffuse_color': tensor([0.8000, 0.8000, 0.8000], device='cuda:0'),
           'shininess': tensor([10.], device='cuda:0'),
           'specular_color': tensor([0.2000, 0.2000, 0.2000], device='cuda:0')},
 'darksilver': {'ambient_color': tensor([0., 0., 0.], device='cuda:0'),
                'diffuse_color': tensor([0.8000, 0.8000, 0.8000], device='cuda:0'),
                'shininess': tensor([10.], device='cuda:0'),
                'specular_color': tensor([0.2000, 0.2000, 0.2000], device='cuda:0')},
 'gold': {'ambient_color': tensor([0., 0., 0.], device='cuda:0'),
          'diffuse_color': tensor([0.6000, 0.5000, 0.1000], device='cuda:0'),
          'shininess': tensor([10.], device='cuda:0'),
          'specular_color': tensor([0.5000, 0.5000, 0.5000], device='cuda:0')},
 'lambert1': {'ambient_color': tensor([0.4000, 0.4000, 0.4000], device='cuda:0'),
              'diffuse_color': tensor([0.4000, 0.4000, 0.4000], device='cuda:0'),
              'shininess': tensor([4.8000], device='cuda:0'),
              'specular_color': tensor([1., 1., 1.], device='cuda:0')},
 'red': {'ambient_color': tensor([0., 0., 0.], device='cuda:0'),
         'diffuse_color': tensor([0.4000, 0.1000, 0.1000], device='cuda:0'),
         'shininess': tensor([10.], device='cuda:0'),
         'specular_color': tensor([0.5000, 0.5000, 0.5000], device='cuda:0')},
 'silver': {'ambient_color': tensor([0., 0., 0.], device='cuda:0'),
            'diffuse_color': tensor([0.8000, 0.8000, 0.8000], device='cuda:0'),
            'shininess': tensor([10.], device='cuda:0'),
            'specular_color': tensor([0.2000, 0.2000, 0.2000], device='cuda:0')},
 'yellow': {'ambient_color': tensor([1., 1., 1.], device='cuda:0'),
            'diffuse_color': tensor([1., 1., 1.], device='cuda:0'),
            'shininess': tensor([10.], device='cuda:0'),
            'specular_color': tensor([0.2000, 0.2000, 0.2000], device='cuda:0')}}
/home/lai/deep_computer_vision/venv_3d_cv/lib/python3.10/site-packages/pytorch3d/io/mtl_io.py:102: UserWarning: Texture UV coordinates outside the range [0, 1].                 The integer part will be ignored to form a repeating pattern.
  warnings.warn(msg)

Creating Textures for the Iron Man Model¶

To address the difficulty mentioned previously, I will manually create artificial textures for the Iron Man model. I have selected a reddish color, represented by the RGB values [1.0, 0.5, 0.5].

In [ ]:
# TODO: Use pytorch3d.renderer.mesh.textures.Textures to construct colorful textures.
# Read https://pytorch3d.readthedocs.io/en/v0.6.0/notes/meshes_io.html?highlight=Meshes
color = [1.0, 0.5, 0.5]
textures = torch.tensor(color).to(device) * torch.ones_like(vertices)  # (num_vertices, 3)

Creating an Meshes object and rescaling the meshes.¶

We are now ready to create a Meshes object. When rendering an object of unknown size, it is highly convenient to initially reposition the object's center to (0, 0, 0) and adjust its scale so that its dimensions fall within the range of [-1, 1]. This standardization simplifies subsequent rendering and manipulation tasks by ensuring uniformity across different models.

In [ ]:
mesh = Meshes(
    verts=[vertices],
    faces=[faces.verts_idx],
    textures=TexturesVertex([textures]),
)

verts = mesh.verts_packed()
print(f"The shape of the packed vertices is {verts.shape}")

center = verts.mean(dim=0)
scale = max((verts - center).abs().max(0)[0])

print("\nBefore moving and rescaling:")
print("Center: ", center)
print("Scale: ", verts.min(), "to", verts.max())

mesh.offset_verts_(-center)
mesh.scale_verts_((1.0 / float(scale)))

print("\nAfter moving and rescaling:")
verts = mesh.verts_packed()
print("Center: ", verts.mean(dim=0))
print("Scale: ", verts.min(), "to", verts.max())
The shape of the packed vertices is torch.Size([129759, 3])

Before moving and rescaling:
Center:  tensor([-1.1979e-01,  1.5489e+02, -8.4590e+00], device='cuda:0')
Scale:  tensor(-130.1906, device='cuda:0') to tensor(254.2135, device='cuda:0')

After moving and rescaling:
Center:  tensor([ 0.0000e+00,  8.5138e-08, -1.4699e-09], device='cuda:0')
Scale:  tensor(-1., device='cuda:0') to tensor(0.8192, device='cuda:0')

Configuring Settings for Mesh Rendering¶

This section provides an overview of the settings required to render a Meshes object effectively:

  • RasterizationSettings: The faces_per_pixel parameter determines the maximum number of faces that can occupy a single pixel in the rasterized image. Increasing this value allows for more detailed visualization of the meshes.

  • PointLights: I introduce a point light into the scene at the coordinates [0.0, 3.0, 10.0]. It's important to note that currently, PyTorch3D only supports lighting configurations for mesh rendering.

  • BlendParams: It supports adjustments to the background color.

  • Materials: Not sure what it does.

In [ ]:
image_size = 512
raster_settings = RasterizationSettings(image_size=image_size, blur_radius=0.0, faces_per_pixel=5)
lights = PointLights(location=[[0.0, 3.0, 10.0]], device=device)
blend_params = BlendParams(sigma=1e-4, gamma=1e-4, background_color=(0.5, 0.5, 0.5))
materials = Materials(device=device, specular_color=[[1.0, 0.0, 0.0]], shininess=10.0)

Configuring Camera Models¶

I first choose FoVPerspectiveCameras to render the scene.

Generating Extrinsic Parameters

Two primary functions facilitate the generation of extrinsic parameters, essential for transforming points from world coordinates to camera coordinates:

  • look_at_view_transform function calculates the extrinsic parameters (rotation and translation) needed for this transformation. It simulates the movement of a camera within the world frame, ensuring it points towards the world's origin. The function parameters, distance from the world's origin (dist), elevation angle above the XZ plane (elev), and azimuth angle relative to the +Z direction (azim), define the camera's position and orientation in the world frame.

  • look_at_rotation function, on the other hand, provides a more intuitive approach to setting the camera's extrinsic parameters. It requires the camera_position vector, indicating the camera's location in world coordinates, and two vectors, at and up, which represent the position of the object and the world coordinate system's upward direction, respectively. Given that we have already moved and rescaled the object to be near the world's origin, using look_at_rotation might offer a clearer method for specifying the camera's position relative to the scene.

Important Note: There are two things I don't quite understand yet. Fist, I don't understand that why look_at_rotation doesn't return translation vector like look_at_view_transform. I use the translation_vector generated by look_at_view_transform to form the extrinsic parameters for look_at_rotation. Second, the translation_vector returned by look_at_view_transform remains the same for various azimuth values (0, 30, and 180 degrees).

In [ ]:
azimuths = [0, 30, 180]
cam_pos = [(0, 0, 3), (2.12, 0, 2.12), (1.73, 1.73, 1.73)]
view_distance = 3

world_to_camera = []

for azim in azimuths:
    R, translation_vector = look_at_view_transform(
        dist=view_distance, elev=0, azim=azim
    )
    print("\nlook_at_view_transform rotation: \n", R)
    print("look_at_view_transform translation: ", translation_vector)
    world_to_camera.append((R, translation_vector))

for pos in cam_pos:
    p = torch.from_numpy(np.array(pos)).unsqueeze(0).float()
    R = look_at_rotation(camera_position=p, at=((0, 0, 0),), up=((0, 1, 0),))
    print("\nlook_at_rotation rotation: \n", R)
    world_to_camera.append((R, translation_vector))

cameras = []
for r, t in world_to_camera:
    fov_persp_cam = FoVPerspectiveCameras(
        znear=1.0,
        zfar=10,
        aspect_ratio=1.0,
        fov=45.0,
        R=r,
        T=t,
        degrees=True,
        device=device,
    )
    cameras.append(fov_persp_cam)
look_at_view_transform rotation: 
 tensor([[[-1.,  0.,  0.],
         [ 0.,  1.,  0.],
         [ 0.,  0., -1.]]])
look_at_view_transform translation:  tensor([[-0., -0., 3.]])

look_at_view_transform rotation: 
 tensor([[[-0.8660,  0.0000, -0.5000],
         [ 0.0000,  1.0000,  0.0000],
         [ 0.5000,  0.0000, -0.8660]]])
look_at_view_transform translation:  tensor([[-0., -0., 3.]])

look_at_view_transform rotation: 
 tensor([[[ 1.0000e+00, -0.0000e+00,  8.7423e-08],
         [ 0.0000e+00,  1.0000e+00,  0.0000e+00],
         [-8.7423e-08,  0.0000e+00,  1.0000e+00]]])
look_at_view_transform translation:  tensor([[-0., -0., 3.]])

look_at_rotation rotation: 
 tensor([[[-1.,  0.,  0.],
         [ 0.,  1.,  0.],
         [ 0.,  0., -1.]]])

look_at_rotation rotation: 
 tensor([[[-0.7071,  0.0000, -0.7071],
         [ 0.0000,  1.0000,  0.0000],
         [ 0.7071,  0.0000, -0.7071]]])

look_at_rotation rotation: 
 tensor([[[-0.7071, -0.4082, -0.5774],
         [ 0.0000,  0.8165, -0.5774],
         [ 0.7071, -0.4082, -0.5774]]])

Constructing the Mesh Renderer¶

The MeshRenderer can be assembled using the previously defined components. It is capable of batch processing multiple camera models simultaneously; however, this method significantly increases GPU memory consumption. To manage resources efficiently, I recommended to render using a single camera model at a time, proceeding iteratively.

Additionally, since MeshRenderer inherits from torch.nn.Module, rendering a mesh and generating the corresponding image can be accomplished by calling renderer(). This function internally invokes the forward method to perform the inference operation.

In [ ]:
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

for i, azim in enumerate(azimuths):
    renderer = MeshRenderer(
        rasterizer=MeshRasterizer(
            cameras=cameras[i + 0], raster_settings=raster_settings
        ),
        shader=SoftPhongShader(
            device=device,
            cameras=cameras[i + 0],
            lights=lights,
            blend_params=blend_params,
            materials=materials,
        ),
    )
    image = renderer(mesh)
    img = image[0, ..., :3].cpu().numpy()
    print(f"Rendered image's shape: {img.shape}.")
    ax[0, i].set_title(f"look_at_view_transform. ele: 0, azim: {azim}", fontsize=12)
    ax[0, i].imshow(img)
    ax[0, i].axis("off")

for i, pos in enumerate(cam_pos):
    renderer = MeshRenderer(
        rasterizer=MeshRasterizer(
            cameras=cameras[i + 3], raster_settings=raster_settings
        ),
        shader=SoftPhongShader(
            device=device,
            cameras=cameras[i + 3],
            lights=lights,
            blend_params=blend_params,
            materials=materials,
        ),
    )
    image = renderer(mesh)
    img = image[0, ..., :3].cpu().numpy()
    print(f"Rendered image's shape: {img.shape}.")
    ax[1, i].set_title(f"look_at_rotation. camera in world: {pos}", fontsize=12)
    ax[1, i].imshow(img)
    ax[1, i].axis("off")

fig.tight_layout()
plt.show()
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
No description has been provided for this image

Discussion:

From the rendering outcomes, considering the point light's position at (0.0, 3.0, 10.0) in world coordinates, I can deduce the following:

  • The top-left image shows the camera is looking at Iron Man's front side. Given that the point light originates from the +Z direction and the camera also captures this front view, it suggests Iron Man is oriented towards +Z, with +Z extending outward from the screen. This orientation implies the camera is facing the -Z direction.

  • The top-middle image indicates an azimuth of 30 degrees, which is the angle between the vector (from the origin to the camera) and the +Z axis.

  • The top-right image reveals that the camera is viewing the rear side of Iron Man, resulting in a darker appearance, because the camera is oriented towards the +Z direction.

  • Now, please reviewing the second-row images, evaluate whether the renderings align logically with the camera's specified world coordinates, ensuring consistency in the visual representation based on the camera's positioning.

Comparing Perspective and Orthographic Projections¶

In orthographic projection, lines that are parallel in the real world remain parallel after the transformation. This type of projection can be described through an affine transformation, which does not account for the perspective distortion seen in perspective projection. This makes orthographic projection particularly useful for technical and engineering drawings where accuracy of parallel lines is crucial.

For this comparison, I am using both FoVPerspectiveCameras and FoVOrthographicCameras, setting the azimuth to 0 and 180 degrees to observe the differences in how these projections render the same scene.

TODO: Expand the comparison to include PerspectiveCameras, OrthographicCameras, and a fisheye lens model to cover a broader range of camera models and their impact on 3D rendering.

In [ ]:
camera_type = [
    functools.partial(
        FoVPerspectiveCameras,
        znear=1.0,
        zfar=10,
        aspect_ratio=1.0,
        fov=60.0,
        degrees=True,
        device=device,
    ),
    functools.partial(
        FoVOrthographicCameras,
        znear=1.0,
        zfar=10,
        max_y=1.5,
        min_y=-1.5,
        max_x=1.5,
        min_x=-1.5,
        device=device,
    ),
]

view_distances = [2, 3]

cameras = []
world_to_view = []
for dist in view_distances:
    R, T = look_at_view_transform(dist=dist, elev=0, azim=0)
    cameras.append([cam_type(R=R, T=T) for cam_type in camera_type])
In [ ]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))

for i, dist in enumerate(view_distances):
    for j, cam_type in enumerate(["perspective", "orthographic"]):
        camera = cameras[i][j]
        renderer = MeshRenderer(
            rasterizer=MeshRasterizer(cameras=camera, raster_settings=raster_settings),
            shader=SoftPhongShader(
                device=device,
                cameras=camera,
                lights=lights,
                blend_params=blend_params,
                materials=materials,
            ),
        )
        image = renderer(mesh)
        img = image[0, ..., :3].cpu().numpy()
        print(f"Rendered image's shape: {img.shape}.")
        ax[i, j].set_title(f"{cam_type} projection. View distance: {dist}", fontsize=12)
        ax[i, j].imshow(img)
        ax[i, j].axis("off")

fig.tight_layout()
plt.show()
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
Rendered image's shape: (512, 512, 3).
No description has been provided for this image

Discussion:

In perspective projection, the distance between the object and the observer is crucial; objects appear larger as the viewing distance decreases. This effect is due to the perspective projection mimicking the way the human eye perceives depth and distance, causing objects closer to the viewer to appear larger than those further away.

Conversely, in orthographic projection, the size of the object remains consistent regardless of the viewing distance. This projection does not account for depth in the same way, leading to a uniform representation of the object's size. Additionally, you can see that objects positioned behind Iron Man are more obscured in perspective projection than in orthographic projection.

Rendering a Rotating Iron Man¶

Now, I will render a rotating Iron Man model under consistent lighting conditions. To achieve a brightly lit model throughout the rotation, it's essential to adjust the position of the light source to match the camera's movement. The concept here is to maintain the illumination on the Iron Man model as if the light source is always facing him, even as the camera angle changes.

Initially, when the azimuth is set to 0, the camera is oriented towards the -Z direction. Correspondingly, we'll position the light source in the -Z direction as well, at [0.0, 3.0, -10.0]. As Iron Man rotates, or as the camera's viewpoint changes, we'll apply a rotation matrix R to the light source. This matrix will mirror the rotation applied to Iron Man or the camera's movement around him, ensuring that the model remains well-lit from the camera's perspective at all times.

In [ ]:
batch_size = 120
light_init_pos = [0, 3, -10]
images = []

for azim in torch.linspace(0, 360, batch_size):
    R, T = look_at_view_transform(dist=2.7, elev=0, azim=azim)
    light_pos_rotate = np.dot(R, light_init_pos)
    light = PointLights(device=device, location=light_pos_rotate.tolist())
    cameras = FoVPerspectiveCameras(
        znear=1.0,
        zfar=3.0,
        aspect_ratio=1.0,
        fov=45.0,
        degrees=True,
        R=R,
        T=T,
        device=device,
    )
    renderer = MeshRenderer(
        rasterizer=MeshRasterizer(cameras=cameras, raster_settings=raster_settings),
        shader=SoftPhongShader(
            device=device, cameras=cameras, lights=light, blend_params=blend_params
        ),
    )

    img = renderer(mesh, materials=materials)[0, ..., :3].cpu().numpy()
    img *= 255
    images.append(img.astype(np.uint8))

imageio.mimsave(OUTPUT_DIR / "ironman_mesh.gif", images, fps=12)

# NOTE: The following warning shows:
# Bin size was too small in the coarse rasterization phase. This caused an overflow,
# meaning output may be incomplete. To solve, try increasing max_faces_per_bin /
# max_points_per_bin, decreasing bin_size, or setting bin_size to 0 to use the naive
# rasterization.

Point Clouds Representation¶

Point clouds are a fundamental representation in computer vision and 3D modeling, capturing the essence of a physical space or object through a collection of points in a three-dimensional coordinate system. Each point in a point cloud is represented by a 3-vector $(x, y, z)$, indicating its position in space:

$$ \mathbf{v} \in \mathbb{R}^3: \quad \mathbf{v} = \begin{pmatrix} x \\ y \\ z \end{pmatrix} $$

Beyond mere coordinates, point clouds often encompass additional attributes that enrich the data. Commonly, these include RGB color information and normals, which represent the orientation of each point. Normals are crucial for understanding the surface geometry and are represented as another 3-vector in $\mathbb{R}^3$. The inclusion of RGB values adds visual detail to the point cloud, allowing for a more accurate and vivid reconstruction of the original scene or object:

$$ \text{RGB color: } \mathbf{c} = \begin{pmatrix} R \\ G \\ B \end{pmatrix}, \quad n = \begin{pmatrix} n_x \\ n_y \\ n_z \end{pmatrix} $$

Point clouds are inherently unordered, meaning there is no intrinsic sequence to the points. This characteristic poses unique challenges and considerations for processing and analyzing point cloud data.

In this section, I render a point cloud of Moszna Castle in Poland. An interesting aspect of this dataset is that the RGB color is encoded as np.uint16, which needs to be converted to floating-point scale between 0 and 1. This conversion is done by dividing the RGB values by $2^{16} - 1$:

$$ \text{Normalized RGB: } \mathbf{c}_{\text{norm}} = \frac{\mathbf{c}}{2^{16} - 1} $$

Additionally, similar to mesh processing, we manipulate the point cloud to center it near the origin (0,0,0) and rescale its position. This normalization step facilitates easier manipulation, visualization, and further processing of the point cloud data, ensuring consistency in scale and orientation across different datasets.

In [ ]:
pc_file = DATA_DIR / "Palac_Moszna/Palac_Moszna.laz"

# Open the LAS/LAZ file
las = laspy.read(pc_file)
print(f"Read in *.las/*laz file as {type(las)}")
print(f"The shape of xyz coordinates of point cloud is {las.xyz.shape}.")
print(f"The range of R channel ({las.red.dtype}): {las.red.min()} to {las.red.max()}")
print(
    f"The range of G channel ({las.green.dtype}): {las.green.min()} to {las.green.max()}"
)
print(
    f"The range of B channel ({las.blue.dtype}): {las.blue.min()} to {las.blue.max()}"
)
assert (
    las.xyz.shape[0] == las.red.shape[0]
    and las.xyz.shape[0] == las.green.shape[0]
    and las.xyz.shape[0] == las.blue.shape[0]
)

# Read xyz as tensor
vertices = torch.Tensor(las.xyz).to(device)

print("\nBefore moving and rescaling points:")
center = vertices.mean(dim=0)
scale = max((vertices - center).abs().max(0)[0])
print("Center: ", center)
print("Scale: ", vertices.min(), "to", vertices.max())

vertices -= center
vertices /= float(scale)
print("\nAfter moving and rescaling points:")
print("Center: ", vertices.mean(dim=0))
print("Scale: ", vertices.min(), "to", vertices.max())


# Scale RGB to [0, 1] and read RGB as tensor
print("\n Scaling RGB:")
rgb = np.stack([las.red, las.green, las.blue], axis=1) / (2**16 - 1)
rgb = torch.Tensor(rgb).to(device)
print(f"The range of RGB is from {rgb.min()} to {rgb.max()}.")

# Create pointcloud object
pointcloud = Pointclouds(points=[vertices], features=[rgb])
Read in *.las/*laz file as <class 'laspy.lasdata.LasData'>
The shape of xyz coordinates of point cloud is (5677589, 3).
The range of R channel (uint16): 0 to 65280
The range of G channel (uint16): 0 to 65280
The range of B channel (uint16): 0 to 65280

Before moving and rescaling points:
Center:  tensor([6.4835e+06, 5.5895e+06, 2.3260e+02], device='cuda:0')
Scale:  tensor(220.9218, device='cuda:0') to tensor(6483673., device='cuda:0')

After moving and rescaling points:
Center:  tensor([ 6.8168e-04, -1.7249e-03,  8.7367e-08], device='cuda:0')
Scale:  tensor(-0.9210, device='cuda:0') to tensor(1., device='cuda:0')

 Scaling RGB:
The range of RGB is from 0.0 to 0.9961089491844177.

Point Cloud Rendering Settings¶

When configuring the rendering settings for a point cloud, several key parameters play a crucial role in determining the visual quality and clarity of the rendered image.

  • image_size: The dimensions of the output image are set to 512x512 pixels.

  • radius: The radius parameter, expressed in Normalized Device Coordinates (NDC), defines the size of each point (represented as a disk in the rendering process) to be rasterized. The choice of radius affects the visual density and overlap of points in the rendered image. A larger radius results in more substantial point overlaps, contributing to a smoother appearance but potentially obscuring finer details.

  • points_per_pixel: This parameter specifies the maximum number of points considered for each pixel in the rendered image. Increasing points_per_pixel enhances the rendering's smoothness and reduces noise by allowing more points to contribute to the color and opacity of each pixel. However, higher values also increase the computational load and memory usage during rendering.

  • background_color: The background color is set to gray [0.5, 0.5, 0.5].

In [ ]:
image_size = 512

raster_settings = PointsRasterizationSettings(image_size=image_size, radius=0.005, points_per_pixel=150)
alpha_comp = AlphaCompositor(background_color=(0.5, 0.5, 0.5))

The First View of the Point Cloud¶

For the initial visualization of the point cloud, I employ FoVOrthographicCameras with an azimuth set to 0 degrees and different elevation values: 0, -60, -89, and -90 degrees.

In [ ]:
distance_view = 3
elevation = [0, -60, -89, -90] # Using -90 will flip the result.

fig, ax = plt.subplots(nrows=1, ncols=len(elevation), figsize=(20, 5))

for i, elev in enumerate(elevation):
    R, T = look_at_view_transform(dist=distance_view, elev=elev, azim=0)
    print(f"\nElevation: {elev} degrees.")
    print("R: ", R)
    print("T: ", T)
    cameras = FoVOrthographicCameras(
        R=R,
        T=T,
        min_x=-1,
        max_x=1,
        min_y=-1,
        max_y=1,
        znear=0.01,
        zfar=10,
        device=device,
    )
    rasterizer = PointsRasterizer(cameras=cameras, raster_settings=raster_settings)
    renderer = PointsRenderer(rasterizer=rasterizer, compositor=alpha_comp)

    images = renderer(pointcloud)
    img = images[0, ..., :3].cpu().numpy()
    print(f"The image's shape: {img.shape}, type: {img.dtype}, min: {img.min()}, max:{img.max()}")
    ax[i].set_title(f"Elevation: {elev} deg", fontsize=12)
    ax[i].imshow(img)
    ax[i].axis("off")

fig.tight_layout()
plt.show()
Elevation: 0 degrees.
R:  tensor([[[-1.,  0.,  0.],
         [ 0.,  1.,  0.],
         [ 0.,  0., -1.]]])
T:  tensor([[-0., -0., 3.]])
The image's shape: (512, 512, 3), type: float32, min: 0.00021912212832830846, max:0.9961090683937073

Elevation: -60 degrees.
R:  tensor([[[-1.0000,  0.0000,  0.0000],
         [ 0.0000,  0.5000,  0.8660],
         [ 0.0000,  0.8660, -0.5000]]])
T:  tensor([[-0., -0., 3.]])
The image's shape: (512, 512, 3), type: float32, min: 5.8950823586201295e-05, max:0.9961090683937073

Elevation: -89 degrees.
R:  tensor([[[-1.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0175,  0.9998],
         [ 0.0000,  0.9998, -0.0175]]])
T:  tensor([[-0.0000e+00, -3.7253e-09,  3.0000e+00]])
The image's shape: (512, 512, 3), type: float32, min: 3.6319339415058494e-05, max:0.9906490445137024

Elevation: -90 degrees.
R:  tensor([[[ 1.0000e+00,  0.0000e+00,  0.0000e+00],
         [-0.0000e+00,  4.3711e-08,  1.0000e+00],
         [ 0.0000e+00, -1.0000e+00,  4.3711e-08]]])
T:  tensor([[-0., -0., 3.]])
The image's shape: (512, 512, 3), type: float32, min: 7.869190449127927e-05, max:0.996108889579773
No description has been provided for this image

Discussion:

The leftmost image, where the camera points towards the -Z direction, offers a top-down perspective of the castle. This setup indicates that the vector from the castle's base to its apex (the highest point) is aligned with the +Z direction. Viewing the castle from this angle mirrors the traditional topographic and aerial views, providing a comprehensive overlook from the zenith.

The transition in the camera's elevation from -89 to -90 degrees results in an upside-down rendering, which I am not sure why.

In [ ]:
def rotation_z(theta):  # Rotation around the z-axis
    theta = np.deg2rad(theta)
    return np.array(
        [
            [np.cos(theta), -np.sin(theta), 0],
            [np.sin(theta), np.cos(theta), 0],
            [0, 0, 1],
        ]
    )


def rotation_y(theta):  # Rotation around the y-axis
    theta = np.deg2rad(theta)
    return np.array(
        [
            [np.cos(theta), 0, np.sin(theta)],
            [0, 1, 0],
            [-np.sin(theta), 0, np.cos(theta)],
        ]
    )


def rotation_x(theta):  # Rotation around the x-axis
    theta = np.deg2rad(theta)
    return np.array(
        [
            [1, 0, 0],
            [0, np.cos(theta), -np.sin(theta)],
            [0, np.sin(theta), np.cos(theta)],
        ]
    )

Constructing World-to-Camera Extrinsic Parameters Manually¶

To achieve a more comprehensive view of the scene that departs from the top-down perspective, constructing extrinsic parameters manually becomes necessary. The process involves rotating the camera around the x-axis and setting a specific translation vector.

Step 1: Rotating the Camera Around the X-Axis

Rotating the camera around the x-axis alters the elevation angle, offering a different perspective of the scene. The rotation matrix for an angle $\theta$ around the x-axis can be represented as:

$$ R_x(\theta) = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta) & -\sin(\theta) \\ 0 & \sin(\theta) & \cos(\theta) \end{pmatrix} $$

Step 2: Defining the Translation Vector

After rotating the camera, setting the translation vector to (0, 0, 3) repositions the camera's perspective in relation to the scene. This translation means that a point located at (0, 0, 0) in world coordinates will be perceived as (0, 0, 3) in camera coordinates. Essentially, this moves the camera's viewpoint 3 units away from the origin along the z-axis.

In [ ]:
# Define the angles
theta_z_degrees = [0, 90]
theta_x_degrees = [0, 90, 120]

T = torch.zeros(3).unsqueeze(dim=0)  # 1D to 2D
T[0, 2] = 3

fig, ax = plt.subplots(nrows=len(theta_z_degrees), ncols=len(theta_x_degrees), figsize=(20, 13))

for i, theta_z in enumerate(theta_z_degrees):
    for j, theta_x in enumerate(theta_x_degrees):
        rotation = np.dot(rotation_z(theta_z), rotation_x(theta_x))
        R = torch.tensor(rotation).to(device=device).unsqueeze(dim=0)
        cameras = FoVOrthographicCameras(
            R=R,
            T=T,
            min_x=-0.8,
            max_x=0.8,
            min_y=-0.8,
            max_y=0.8,
            znear=0.01,
            zfar=10,
            device=device,
        )
        rasterizer = PointsRasterizer(cameras=cameras, raster_settings=raster_settings)
        renderer = PointsRenderer(rasterizer=rasterizer, compositor=alpha_comp)

        images = renderer(pointcloud)
        img = images[0, ..., :3].cpu().numpy()
        print(f"The image's shape: {img.shape}, type: {img.dtype}, min: {img.min()}, max:{img.max()}")
        ax[i, j].set_title(f"X-axis rotation: {theta_x}, Z-axis rotation: {theta_z}", fontsize=12)
        ax[i, j].imshow(img)
        ax[i, j].axis("off")

fig.tight_layout()
plt.show()
The image's shape: (512, 512, 3), type: float32, min: 8.101465937215835e-05, max:0.9925998449325562
The image's shape: (512, 512, 3), type: float32, min: 6.249735452001914e-05, max:0.9961091876029968
The image's shape: (512, 512, 3), type: float32, min: 2.5688828827696852e-05, max:0.9961092472076416
The image's shape: (512, 512, 3), type: float32, min: 8.101465937215835e-05, max:0.9925998449325562
The image's shape: (512, 512, 3), type: float32, min: 3.453396857366897e-05, max:0.991234302520752
The image's shape: (512, 512, 3), type: float32, min: 5.847863576491363e-06, max:0.9944634437561035
No description has been provided for this image

Discussion:

The top-left image may seem like we are viewing the castle from above, but in reality, the perspective is from below, directed towards the +Z direction. This understanding aligns with the earlier explanation that the castle's apex points towards the +Z direction.

Consequently, to observe the castle's surface directly, the camera must rotate beyond the 90 degrees around x-axis. However, determining whether this rotation should be 90 degrees or -90 degrees is difficult, especially since the point cloud itself remains static, and it is the camera that moves.

Now, please inspect every image to see if the rendering corresponds accurately with the specified settings of $\theta_x$ and $\theta_z$.

In [ ]:
# Set batch size - this is the number of different viewpoints from which we want to render the mesh.
batch_size = 120
angles = torch.linspace(0, 360, batch_size)

theta_x = 120

images = []
for ang in angles:
    rotation = np.dot(rotation_z(ang), rotation_x(theta_x))
    R = torch.tensor(rotation).to(device=device).unsqueeze(dim=0)

    camera = FoVOrthographicCameras(
        R=R,
        T=T,
        min_x=-0.8,
        max_x=0.8,
        min_y=-0.8,
        max_y=0.8,
        znear=0.01,
        zfar=10,
        device=device,
    )

    rasterizer = PointsRasterizer(cameras=camera, raster_settings=raster_settings)
    renderer = PointsRenderer(rasterizer=rasterizer, compositor=alpha_comp)

    img = renderer(pointcloud).cpu().numpy()[0, ..., :3]
    img *= 255
    images.append(img.astype(np.uint8))

imageio.mimsave(OUTPUT_DIR / "Palac_Moszna_pointcloud.gif", images, fps=12)

torch.cuda.empty_cache()
gc.collect()
Out[ ]:
20221

Voxels Representation¶

Voxels, short for volumetric pixels, represent a value on a regular grid in three-dimensional space. Unlike pixels, which cover 2D surface areas, voxels extend into 3D, offering depth. Each voxel can be in one of two states: 0 or 1, indicating whether it occupies space or not. This binary voxel grid approach simplifies modeling the presence or absence of material in 3D environments.

In addition to their binary state, voxels can hold more complex data such as density and color, enhancing the detail and realism of 3D models. Both density and color values are normalized between [0, 1], where 0 represents the absence of material or color, and 1 signifies the maximum density or full intensity of color.

In PyTorch3D, the Volumes class facilitates the creation of a batch of volume objects, utilizing densities and features as its primary parameters. The densities parameter is designed to encapsulate different measures of volume cell "density," which can include concepts like opacity or signed/unsigned distances from the nearest surface. In the context of our cow voxels, densities are used to indicate these various opacity measures. The features parameter, on the other hand, typically carries additional attributes such as colors.

Note: I've previously executed the fit_textured_volume.ipynb notebook to create voxel data for a cow model, which is now stored in "cow_mesh/cow_voxel_grid.npz". This tutorial will concentrate on the rendering of voxels, bypassing the model fitting process.

In [ ]:
voxel_file = DATA_DIR / "cow_mesh/cow_voxel_grid.npz"

# Load point cloud
# the density and color has been applied to sigmoid function.
voxels = np.load(voxel_file)
densities = voxels["densities"]
colors = voxels["colors"]
voxelsize = voxels["voxelsize"]  # a scalar array can be converted to a scalar value
render_size = voxels["render_size"].item()
volume_extent_world = voxels["volume_extent_world"].item()

densities = torch.Tensor(densities).to(device)
colors = torch.Tensor(colors).to(device)
voxelsize = torch.Tensor(voxelsize)

print(densities.shape, densities.min(), densities.max())
print(colors.shape, colors.min(), colors.max())
print("voxelsize: ", voxelsize)
print("volume_extent_world: ", volume_extent_world)


image_size = 512
raysampler = NDCMultinomialRaysampler(
    image_width=image_size,
    image_height=image_size,
    n_pts_per_ray=350,  # maximum value if bigger, memory error
    # n_rays_per_image=
    # n_rays_total=
    min_depth=0.01,
    max_depth=volume_extent_world,
)
renderer = VolumeRenderer(
    raysampler=raysampler,
    raymarcher=EmissionAbsorptionRaymarcher(),
)
R, T = look_at_view_transform(dist=2.7, elev=0, azim=180)
camera = FoVPerspectiveCameras(
    R=R,
    T=T,
    znear=0.01,
    zfar=100,
    aspect_ratio=1,
    fov=50,
    device=device,
)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 15))
for i, density_factor in enumerate([0.1, 1]):
    volumes = Volumes(
        densities=[density_factor * densities],
        features=[colors],
        voxel_size=voxelsize,
    )
    rendered_images, rendered_silhouettes = renderer(cameras=camera, volumes=volumes)
    img = rendered_images[0, ..., :3].cpu().numpy()
    ax[i].set_title(f"density_factor: {density_factor}", fontsize=12)
    ax[i].imshow(img)
    ax[i].axis("off")

fig.tight_layout()
plt.show()

torch.cuda.empty_cache()
gc.collect()
torch.Size([1, 128, 128, 128]) tensor(9.6917e-05, device='cuda:0') tensor(0.9929, device='cuda:0')
torch.Size([3, 128, 128, 128]) tensor(0.0013, device='cuda:0') tensor(0.9988, device='cuda:0')
voxelsize:  tensor(0.0234)
volume_extent_world:  3.0
No description has been provided for this image
Out[ ]:
40

Configuring Volume Rendering Settings¶

Parameters of NDCMultinomialRaysampler:

  • image_size: Specifies the dimensions of the output image (width and height) in pixels. This determines the resolution of the ray grid.
  • n_pts_per_ray: The number of points to sample along each ray. More points allow for a more detailed rendering of the volume but increase computation time.
  • min_depth: The minimum depth (in NDC) from which to start sampling along each ray. This parameter helps in clipping the volume space to start rendering from a specific depth.
  • max_depth: The maximum depth (in NDC) up to which to sample along each ray, allowing the clipping of the volume space at the far end.

NOTE: At the moment, the VolumeRenderer does not support direct lighting.

In [ ]:
image_size = 512

raysampler = NDCMultinomialRaysampler(
    image_width=image_size,
    image_height=image_size,
    n_pts_per_ray=400,  # maximum value if bigger, memory error
    min_depth=0.01,
    max_depth=volume_extent_world,
)
raymarcher = EmissionAbsorptionRaymarcher()
renderer = VolumeRenderer(
    raysampler=raysampler,
    raymarcher=raymarcher,
)

elevations = [-30, 0, 30]
view_distances = [2, 2.7, 3.5]

fig, ax = plt.subplots(
    nrows=len(elevations), ncols=len(view_distances), figsize=(15, 15)
)

for i, dist in enumerate(view_distances):
    for j, elev in enumerate(elevations):
        R, T = look_at_view_transform(dist=dist, elev=elev, azim=180)
        camera = FoVPerspectiveCameras(
            R=R,
            T=T,
            znear=0.01,
            zfar=100,
            aspect_ratio=1,
            fov=50,
            device=device,
        )

        rendered_images, rendered_silhouettes = renderer(
            cameras=camera, volumes=volumes
        )

        img = rendered_images[0, ..., :3].cpu().numpy()
        ax[i, j].set_title(f"view_distances: {dist}, Elevations: {elev}", fontsize=12)
        ax[i, j].imshow(img)
        ax[i, j].axis("off")

fig.tight_layout()
plt.show()
No description has been provided for this image

Discussion:

In the last row of images, we can observe that voxels located at the rear of the observed volume are not rendered or are "cut off" from the viewer's perspective.

The max_depth parameter in volume rendering settings defines the maximum distance from the viewpoint (or camera) at which rays will sample the volume. Essentially, it sets a boundary in the depth of the scene beyond which no sampling occurs. When the max_depth is too small relative to the observer's viewing distance, it restricts the depth range that the rays can explore. As a result, parts of the volume that lie beyond this max_depth threshold are not considered in the final rendered image.

In [ ]:
raysampler = NDCMultinomialRaysampler(
    image_width=image_size,
    image_height=image_size,
    n_pts_per_ray=350,  # maximum value if bigger, memory error
    min_depth=0.01,
    max_depth=5,
)
renderer = VolumeRenderer(
    raysampler=raysampler,
    raymarcher=EmissionAbsorptionRaymarcher(),
)

batch_size = 120
azimuths = torch.linspace(-180, 180, batch_size)

images = []
for azim in azimuths:
    R, T = look_at_view_transform(dist=3, elev=0, azim=azim)
    camera = FoVPerspectiveCameras(
        R=R,
        T=T,
        znear=-10,
        zfar=100,
        aspect_ratio=1,
        fov=50,
        device=device,
    )

    rendered_images, rendered_silhouettes = renderer(cameras=camera, volumes=volumes)
    img = rendered_images[0, ..., :3].cpu().numpy()
    img *= 255
    images.append(img.astype(np.uint8))

imageio.mimsave(OUTPUT_DIR / "cow_voxel.gif", images, fps=12)

torch.cuda.empty_cache()
gc.collect()
Out[ ]:
5155
In [ ]: